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Abstract 

This paper is triggered by the preprint " Computing Matrix Squareroot via Non Convex Local Search" 
by Jain et al. (arXiw.1507.05854), which analyzes gradient-descent for computing the square root of a 
positive definite matrix. Contrary to claims of Jain et al. [14], our experiments reveal that Newton-like 
methods compute matrix square roots rapidly and reliably, even for highly ill-conditioned matrices 
and without requiring commutativity. We observe that gradient-descent converges very slowly 
primarily due to tiny step-sizes and ill-conditioning. We derive an alternative first-order method 
based on geodesic convexity: our method admits a transparent convergence analysis (< 1 page), 
attains linear rate, and displays reliable convergence even for rank deficient problems. Though 
superior to gradient-descent, ultimately our method is also outperformed by a well-known scaled 
Newton method. Nevertheless, the primary value of our work is its conceptual value: it shows that 
for deriving gradient based methods for the matrix square root, the manifold geometric view of positive 
definite matrices can be much more advantageous than the Euclidean view. 


1 Introduction 

Matrix functions such as A K , log A, or e A {A E C nx ”) arise in diverse contexts. Higham [9] provides an 
engaging overview on the topic of matrix functions. A building block for efficient numerical evaluation 
of various matrix functions is the matrix square root A 1 ^ 2 , a function whose computation has witnessed 
great interest in the literature [1, 2, 8, 10, 12, 20]; see also [9, Ch. 6]. 

The present paper revisits the problem of computing the matrix square root for a symmetric 
positive semidefinite (psd) matrix. 1 Our work is triggered by the recent note of [14], who analyze 
a simple gradient-descent procedure for computing A 1//2 for a given psd matrix A. Jain et al. [14] 
motivate their work by citing weaknesses of a direct application of Newton's method (Algorithm X 
in [10]), which is known to suffer from ill-conditioning and whose convergence analysis depends on 
commutativity of its iterates, and thus breaks down as round-off error invalidates commutativity. 

However, it turns out that the modified "polar-Newton" (Pn) method of [9, Alg. 6.21] avoids 
these drawbacks, and offers a theoretically sound method with excellent practical behavior. At the 
same time, gradient-descent of [14], while initially appealing due to its simplicity, turns out to be 
empirically weak: it converges extremely slowly, especially on ill-conditioned matrices because its 
stepsizes become too small to ensure progress on computers with limited floating point computation. 

At the expense of introducing "yet another matrix square root" (Yamsr) method, we present a 
non-Euclidean first-order method grounded in the geometric optimization framework of [23]. Our 
method admits a transparent convergence analysis, attains linear (geometric) convergence rate, and 
displays reliable behavior even for highly ill-conditioned problems (including rank-deficient problems). 
Although numerically superior to the gradient-descent approach of [14], like other linearly convergent 
strategies Yamsr is still outperformed by the well-known polar-Newton (scaled Newton) method. 

In light of these observations, there are two key messages delivered by our paper: 

‘The same ideas extend trivially to the Hermitian psd case. 
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► None of the currently known (to us) first-order methods are competitive with the established 

second-order polar-Newton algorithm for computing matrix square roots. 2 

► The actual value of this paper is conceptual: it shows that for deriving gradient based methods that 

compute matrix square roots, the differential geometric view of positive definite matrices is much more 

advantageous than the Euclidean view. 

As a further attestation to the latter claim, we note that our analysis yields as a byproduct a convergence 
proof of a (Euclidean) iteration due to Ando [2], which previously required a more intricate analysis. 
Moreover, we also obtain a geometric rate of convergence. 

Finally, we note that the above conceptual view underlies several other applications too, where the 
geometric view of positive definite matrices has also proved quite useful, e.g., [7, 11, 22-26]. 

1.1 Summary of existing methods 

To place our statements in wider perspective we cite below several methods for matrix square roots. 
We refer the reader to [9, Chapters 6, 8] for more details and many more references, as well as a broader 
historical perspective. For a condensed summary, we refer the reader to the survey of Iannazzo [12] 
which covers the slightly more general problem of computing the matrix geometric mean. 

1. Eigenvectors: Compute the decomposition A — UAU* and obtain A 1 ,/2 — UA 1 ' 2 LI . While 
tempting, this can be slower for large and ill-conditioned matrices. In our case, since Ay 0 , this 
method coincides with the Schur-Cholesky method mentioned in [12]. 

2. Matrix averaging: Many methods exist that obtain the matrix geometric mean as a limit of 
simpler averaging procedures. These include scaled averaging iterations (which is essentially a 
matrix Harmonic-Arithmetic mean iteration that converges to the geometric mean) and their 
variants [12, §5.1]. In spirit, the algorithm that we propose in this paper also falls in this category. 

3. Newton-like: The classic Newton iteration of [10], whose weaknesses motivate the gradient- 
descent procedure of [14]. However, a better way to implement this iteration is to use polar 
decomposition. Here, first we compute the Cholesky factorization A — R T R, then obtain the 
square root as A 1 ,/2 = UR, where U is the unitary polar factor of R 1 . This polar factor could be 
computed using the SVD of R, but the polar-Newton (Pn) iteration [9, Alg. 6.21] avoids that and 
instead uses the scaled Newton method 

Lffc+l — 2 {f l k^k + If T ) / Uo — R. 

This iteration has the benefit that the number of steps needed to attain a desired tolerance 
can be predicted in advance, and it can be implemented to be backward stable. It turns out 
that empirically this iteration is works reliably even for extremely ill-conditioned matrices, and 
converges (locally) superlinearly. The scaled Halley-iteration of Nakatsukasa et al. [20] converges 
even more rapidly and can be implemented in a numerically stable manner [19]. 

4. Quadrature: Methods based on Gaussian and Gauss-Chebyshev quadrature and rational min¬ 
imax approximations to 2' 1 ,/2 are surveyed in [12, §5.4]. Among others, these methods easily 
tackle computation of general powers, logarithms, and some other matrix functions [8, 13]. 

5. First-order methods: The binomial iteration which uses (I — C) 1 ^ 2 = I — ffj a ;C / for suitable 
otj > 0 and applies when p(C) < 1 [9, §6.8.1]. The gradient-descent method of [14] which 
minimizes ||X 2 — A || 2 . Like the binomial iteration, this method does not depend on linear 
system solves (or matrix inversions) and uses only matrix multiplication. The method introduced 
in section 2.1 is also a first-order method, though cast in a non-Euclidean space; it does, however, 
require (Cholesky based) linear system solves. 

Presumably, the same conclusion also holds for many other matrix functions such as powers and logarithms. 
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We mention other related work in passing: fast computation of a matrix C such that AP « CC T for 
SDD matrices [6]; Krylov-subspace methods and randomized algorithms for computing the product 
f(A)b [9, Chapter 13]; and the broader topic of multivariate matrix means [3, 5, 15, 17, 21, 22]. 


2 Geometric Optimization 

We present details of computing the matrix square root using geometric optimization. Specifically we 
cast the problem of computing matrix square roots into the nonconvex optimization problem 

min Sl(X,A)+6 2 s (X, t), (2.1) 

whose unique solution is the desired matrix square root X* — ' 2 ; here S 2 denotes the S-Divergen.ee [22]: 

5 2 s {X,Y) := logdet (^) - \ logdet(XY). (2.2) 

To present our algorithm for solving (2.1) let us first recall some background. 

The crucial property that helps us solve (2.1) is geodesic convexity of S 2 , that is, convexity along 
geodesics in P". Consider thus, the geodesic (see [4, Ch. 6] for a proof that this is the geodesic): 

X# f Y := X 1/2 (X" 1/2 YX” 1/2 ) f X 1/2 , X,Y^ 0 , fe[ 0 ,l], (2.3) 

that joins X to Y. A function / : P” —t IR is called geodesically convex (g-convex) if it satisfies 

/(X#fY) < (1 — t)f (X) + tf (Y). (2.4) 

Theorem 1. The S-Divergence (2.2) is jointly geodesically convex on (Hermitian) psd matrices. 

Proof. See [22, Theorem 4.4]; we include a proof in the appendix for the reader's convenience. □ 

Similar to Euclidean convexity, g-convexity bestows the crucial property: "local ==> global." 
Consequently, we may use any algorithm that ensures local optimality; g-convexity will imply global 
optimality. We present one such algorithm in section 2.1 below. 

But first let us see why solving (2.1) yields the desired matrix square root. 

Theorem 2. Let A, B >- 0 . Then, 

A# 1/2 B = argmin x ^ 0 <S 2 (X, A) + < 5 §(X, B). (2.5) 

Moreover, A#i/ 2 B is equidistant from A and B, i.e., S 2 (A,A#i/ 2 B) — S 2 (B, A#i/ 2 B). 

Proof. See [22, Theorem 4.1]. We include a proof below for convenience. 

Since the constraint set is open, we can simply differentiate the objective in (2.5), and set the 
derivative to zero to obtain the necessary condition 



X” 1 = (X + A)- 1 + (X + B)" 1 
(X + A)X _1 (X + B) — 2 X + A + B 
B = XA -1 X. 


The latter is a Riccati equation whose unique positive solution is X = A#i/ 2 B —see [4, Prop 1.2.13] 3 . 
Global optimality of this solution follows easily since S 2 is g-convex as per Thm. 1. □ 

Corollary 3. The unique solution to (2.1) is X* = A# 1 / 2 I — A 1,/2 . 

3 There is a typo in the cited result, in that it uses A and A 1 ^ 2 where it should use A -1 and A -1 ^ 2 . 
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2.1 Algorithm for matrix square root 

We present now an iteration for solving (2.1). As for Thm. 2, we obtain here the optimality condition 



X” 1 = (X + A) -1 + (X + I) -1 . 

( 

Our algorithm solves (2.6) simply by running 4 



X 0 = i(A + I) 

(2.7) 


X m <- [(X fc + A)” 1 + (X fc + I)" 1 ]" 1 , k = 0,1 . 

(2.8) 


The following facts about our algorithm and its analysis are noteworthy: 


1. Towards completion of this article we discovered that iteration (2.8) has been known in matrix 
analysis since over three decades! Indeed, motivated by electrical resistance networks, Ando 
[2] analyzed exactly the same iteration as (2.8) in 1981. Our convergence proof (Theorem 4) is 
much shorter and more transparent—it not only reveals the geometry behind its convergence 
but also yields explicit bounds on the convergence rate, thus offering a new understanding of 
the classical work of Ando [2]. 

2. Initialization (2.7) is just one of many. Any matrix in the psd interval [ 2(7 + A" 1 )^ 1 , \{A + I)] is 
equally valid; different choices can lead to better empirical convergence. 

3. Each iteration of the algorithm involves three matrix inverses. Theoretically, this costs "just" 
0 (n w ) operations. In practice, we compute (2.8) using linear system solves; in Matlab notation: 

R = (X+A)\I + (X+I)\I; X = R\I; 

4. The gradient-descent method of [14] and the binomial method [9] do not require solving linear 
systems, and rely purely on matrix multiplication. But both turn out to be slower than (2.8) 
while also being more sensitive to ill-conditioning. 

5. For ill-conditioned matrices, it is better to iterate (2.8) with al, for a suitable scalar ct > 0 ; the 
final solution is recovered by dowscaling by a _ 1//2 . A heuristic choice is a — tr (A)/ y/n, which 
seems to work well in practice (for well-conditioned matrices oc — 1 is preferable). 

Theorem 4 (Convergence). Let {X^}j.> 0 be generated by (2.y)-(i.8). Let X* — A 1 ^ 2 be the optimal solution 
to (2.1). Then, there exists a constant 7 < 1 such that 3 2 (X^,X*) < y k S 2 (Xo,X*). Moreover, limXj- = X*. 

Proof. Our proof is a specialization of the fixed-point theory in [18, 23]. Specifically, we prove that (2.8) 
is a fixed-point iteration under the Thompson part metric 

< 5 T (X,Y):=||log(X- 1/2 YX- 1 / 2 )||, (2.9) 

where || ■ || is the usual operator norm. The metric (2.9) satisfies many remarkable properties; for our 
purpose, we need the following three well-known properties (see e.g., [23, Prop. 4.2]): 

<St(X -1 ,Y -1 ) = 6 t (X,Y), S t (X + A,Y + B) <ma x{ 3 T {X,Y), 3 T (A,B)}, 

S T (X + A,Y + A)< ^-^ 5 t{X, Y), a = max{ ||X||, || Y||}. (2 ‘ 10) 

Consider now the nonlinear map 

Q = X [(X + A)- 1 + (X + 7 ) -1 ] -1 , 

4 Observe that the same algorithm also computes A^ipB if we replace I by B in (2.7) and (2.8). 
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corresponding to iteration (2.8). Using properties (2.10) of the Thompson metric Sj we have 

W(X),S(Y)) - «J T ([(X + A)" 1 + (X + I)' 1 }- 1 , [(Y + A)' 1 + (Y + I)" 1 ]" 1 ) 

- S T ((X + A)- 1 + (X + I)-\ (Y + A)- 1 + (Y + I)- 1 ) 

< max{<S r ((X + A)" 1 , (Y + A)" 1 ),< 5 T ((X + I)~\ (Y + t)” 1 )} 

= ma x{S T (X + A,Y + A),6 T (X + I,Y + I)} 

< max{7 1( 5 r (X / Y) / 7 2( 5 T (X / Y)} / 71 = fl+A ^ n(A) /?2 = a = max{||X||, ||Y||}, 

< j6t(X,Y), 7 < 1, 

where we can choose 7 to be independent of X and Y. Thus, the map Q is a strict contraction. Hence, 
from the Banach contraction theorem it follows that X*) converges at a linear rate given by 7, 

and that X k —> X*. Notice that Q maps the (compact) interval [ 2 (t + A^ 1 ) -1 , j(A + I)] to itself. Thus, 
a. j ||I + A ||; since is increasing, we can easily upper-bound 71,72 to finally obtain 

1 + 11A11 

7 1 + || A|| + 2 min(A min (A),l)' 

This is strictly smaller than 1 if A y 0 , thus yielding an explicit contraction rate. □ 

Remark 1: The bound on 7 above is a worst-case bound. Empirically, the value of a in the bound is 
usually much smaller and the convergence rate commensurately faster. 

Remark 2: Starting with Xo = j (A + I) ensures that Xo >- 0 ; thus, we can use the iteration (2.8) even if 
A is semidefinite, as all intermediate iterates remain well-defined. This suggests why iteration (2.8) is 
empirically robust against ill-conditioning. 

Example 1. Suppose A = 0 . Then, the map Q is no longer contractive but still nonexpansive. 
Iterating (2.8) generates the sequence {X A -} = \ \l, 1 1 , yyA I ,...}, which converges to zero. 

Example 1 shows that iteration (2.8) remains well-defined even for the zero matrix. This prompts 
us to take a closer look at computing square roots of low-rank semidefinite matrices. In particular, 
we extend the definition of the S-Divergence to low-rank matrices. Let A be a rank -r semidefinite 
matrix of size n x n where r < n. Then, define det, (A) := ITJ , A;(A) to be product of its r positive 
eigenvalues. Using this, we can extend (2.2) as 

s l,A X ' Y ) : = logdet,. - 2logdetr(X) - jlogdet r (Y), (2.11) 

for rank -r SPD matrices X and Y. If rank(X) 7^ rank(Y), we set Ss,r{X, Y) — +00. The above definition 
can also be obtained as a limiting form of (2.2) by applying to rank deficient X and Y by considering 
X + el and Y + el and letting e —> 0 . 

Although iteration (2.8) works remarkably well in practice, it is a question of future interest on 
how to obtain square roots for rank deficient matrices faster by exploiting (2.11) instead. 


3 Numerical results 

We present numerical results comparing running times and accuracy attained by: (i) Yamsr; (ii) 
Gd; (iii) LsGd; and (iv) Pn. These methods respectively refer to iteration (2.8), the fixed step-size 
gradient-descent procedure of [14], our line-search based implementation of gradient-descent, and the 
polar-Newton iteration of [9, Alg. 6.21]. 

We present only one experiment with Gd, because with fixed steps it vastly underperforms all 
other methods. In particular, if we set the step size according to the theoretical bounds of [14], then 
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Relative error (Frobenius norm) Relative error (Frobenius norm) 


for matrices with large condition numbers the step size becomes smaller than machine precision! In 
our experiments with Gd we used step sizes much larger than theoretical ones, otherwise the method 
makes practically no progress at all. More realistically, if we wish to use gradient-descent we must 
employ line-search. Ultimately, although substantially superior to plain Gd, even LsGd turns out to 
be outperformed by Yamsr, which in turn is superseded by Pn. 

We experiment with the following matrices: 

i.I + fill LI' for a low-rank matrix U and a variable constant j6. These matrices are well-conditioned. 

2. Random Correlation matrices (Matlab: gallery(’randcorr ’, n)); medium conditioned. 

3. The Hilbert matrix (Matlab: tiilb(n)). This is a well-known ill-conditioned matrix class. 

4. The inverse Hilbert matrix (Matlab: invhilb(n)). The entries of the inverse Hilbert matrix are 
very large integers. Extremely ill-conditioned. 



Figure 1: Running time comparison between Gd and LsGd; this behavior is typical. 



50 x 50 matrix I + f>UU T 
K « 64 


50 x 50 Hilbert matrix 

K W 2.8 X 10 18 


100 x 100 inverse Hilbert matrix 

KK 9.1 X 10 96 


500 x 500 Correlation 

k « 832 


Figure 2: Running time comparison: Yamsr vs LsGd; this behavior is typical (k is condition number). 



50 x 50 matrix I + f>UU T 


Running time (seconds) 

100 x 100 Hilbert matrix 




150 x 150 inverse Hilbert matrix 


500 x 500 Correlation 


Figure 3: Running time comparison between Yamsr and Pn; this behavior is also typical. 
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Figure 4: Square root computation of a 500 x 500 low-rank (rank 50) covariance matrix. 


4 Conclusions 

We revisited computation of the matrix square root, and compared the recent gradient-descent 
procedure of [14] against the standard polar-Newton method of [9, Alg. 6.21], as well as Yamsr, a 
new first-order fixed-point algorithm that we derived using the viewpoint of geometric optimization. 
The experimental results show that the polar-Newton method is the clear winner for computing 
matrix square roots (except near the 10 ~ 15 accuracy level for well-conditioned matrices). Among the 
first-order methods Yamsr outperforms both gradient-descent as well its superior line-search variant 
across all tested settings, including singular matrices. 
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A Technical details 

Proof of Theorem 1. It suffices to prove midpoint convexity; the general case follows by continuity. 
Consider therefore psd matrices Xi,X 2 , Y\, Y 2 . We need to show that 

^(Xi# 1 /2 X 2 ,Y 1 # 1/2 Y 2 ) < \ 5 g{X\,Y\) + ^!(X 2 / Y 2 ). (A.i) 


From the joint concavity of the operator #] / 2 [16] we know that 

Xi# 1/2 Xi + Y!# 1/2 Y 2 A (Xi + Yi)# 1/2 (X 2 + Y 2 ). 

Since log det is a monotonic function, applying it to this inequality yields 

logdet[X 1 # 1/2 X 1 + Y!# 1/2 Y 2 ] < logdet[(X 1 + Y 1 )# 1/2 (X 2 + Y 2 )]. (A.2) 

But we also know that log det ( (1 — t)X#fY) — (1 — f) log det(X) +1 log det (Y). Thus, a brief algebraic 
manipulation of (A.2) combined with the definition (2.2) yields inequality (A.i) as desired. □ 
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